{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Spacy\n",
    "For the features of the words we're going to use spacy. Once we tokenize the text, we can access features like Part of speech and others."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import spacy\n",
    "nlp = spacy.load('en_core_web_sm')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Displaying progress\n",
    "tqdm is a progress bar. It will come usefeful when we extract the answers for all the articles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm import tqdm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:05<00:00,  1.97it/s]\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "\n",
    "for i in tqdm(range(10)):\n",
    "    time.sleep(0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Pickling\n",
    "Once we extract all the words from the texts, we'll save them using pickle. Then we can easily use them in the other modules and have to wait for them to generat again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import _pickle as cPickle\n",
    "from pathlib import Path\n",
    "\n",
    "def dumpPickle(fileName, content):\n",
    "    pickleFile = open(fileName, 'wb')\n",
    "    cPickle.dump(content, pickleFile, -1)\n",
    "    pickleFile.close()\n",
    "\n",
    "def loadPickle(fileName):    \n",
    "    file = open(fileName, 'rb')\n",
    "    content = cPickle.load(file)\n",
    "    file.close()\n",
    "    \n",
    "    return content\n",
    "    \n",
    "def pickleExists(fileName):\n",
    "    file = Path(fileName)\n",
    "    \n",
    "    if file.is_file():\n",
    "        return True\n",
    "    \n",
    "    return False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "train = pd.read_json('../data/squad-v1/train-v1.1.json', orient='column')\n",
    "dev = pd.read_json('../data/squad-v1/dev-v1.1.json', orient='column')\n",
    "\n",
    "df = pd.concat([train, dev], ignore_index=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>data</th>\n",
       "      <th>version</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>{'title': 'University_of_Notre_Dame', 'paragra...</td>\n",
       "      <td>1.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>{'title': 'Beyoncé', 'paragraphs': [{'context'...</td>\n",
       "      <td>1.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>{'title': 'Montana', 'paragraphs': [{'context'...</td>\n",
       "      <td>1.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>{'title': 'Genocide', 'paragraphs': [{'context...</td>\n",
       "      <td>1.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>{'title': 'Antibiotics', 'paragraphs': [{'cont...</td>\n",
       "      <td>1.1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                data  version\n",
       "0  {'title': 'University_of_Notre_Dame', 'paragra...      1.1\n",
       "1  {'title': 'Beyoncé', 'paragraphs': [{'context'...      1.1\n",
       "2  {'title': 'Montana', 'paragraphs': [{'context'...      1.1\n",
       "3  {'title': 'Genocide', 'paragraphs': [{'context...      1.1\n",
       "4  {'title': 'Antibiotics', 'paragraphs': [{'cont...      1.1"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting words from a paragrapgh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "currText = df['data'][0]['paragraphs'][0]['context']\n",
    "currQas = df['data'][0]['paragraphs'][0]['qas']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "currDoc = nlp(currText)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Extract answers and the sentence they are in\n",
    "def extractAnswers(qas, doc):\n",
    "    answers = []\n",
    "\n",
    "    senStart = 0\n",
    "    senId = 0\n",
    "\n",
    "    for sentence in doc.sents:\n",
    "        senLen = len(sentence.text)\n",
    "\n",
    "        for answer in qas:\n",
    "            answerStart = answer['answers'][0]['answer_start']\n",
    "\n",
    "            if (answerStart >= senStart and answerStart < (senStart + senLen)):\n",
    "                answers.append({'sentenceId': senId, 'text': answer['answers'][0]['text']})\n",
    "\n",
    "        senStart += senLen\n",
    "        senId += 1\n",
    "    \n",
    "    return answers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'sentenceId': 1, 'text': 'a golden statue of the Virgin Mary'},\n",
       " {'sentenceId': 2, 'text': 'a copper statue of Christ'},\n",
       " {'sentenceId': 3, 'text': 'the Main Building'},\n",
       " {'sentenceId': 4, 'text': 'a Marian place of prayer and reflection'},\n",
       " {'sentenceId': 5, 'text': 'Saint Bernadette Soubirous'}]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "currAnswers = extractAnswers(currQas, currDoc)\n",
    "currAnswers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "#TODO - Clean answers from stopwords?\n",
    "def tokenIsAnswer(token, sentenceId, answers):\n",
    "    for i in range(len(answers)):\n",
    "        if (answers[i]['sentenceId'] == sentenceId):\n",
    "            if (answers[i]['text'] == token):\n",
    "                return True\n",
    "    return False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenIsAnswer('the Main Building', 4, currAnswers)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Save named entities start points\n",
    "\n",
    "def getNEStartIndexs(doc):\n",
    "    neStarts = {}\n",
    "    for ne in doc.ents:\n",
    "        neStarts[ne.start] = ne\n",
    "        \n",
    "    return neStarts "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NORP\n"
     ]
    }
   ],
   "source": [
    "currNeStarts = getNEStartIndexs(currDoc)\n",
    "\n",
    "if 6 in currNeStarts:\n",
    "    print(currNeStarts[6].label_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "def getSentenceStartIndexes(doc):\n",
    "    senStarts = []\n",
    "    \n",
    "    for sentence in doc.sents:\n",
    "        senStarts.append(sentence[0].i)\n",
    "    \n",
    "    return senStarts\n",
    "    \n",
    "def getSentenceForWordPosition(wordPos, senStarts):\n",
    "    for i in range(1, len(senStarts)):\n",
    "        if (wordPos < senStarts[i]):\n",
    "            return i - 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0, 9, 25, 55, 68, 84, 108]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "senStarts = getSentenceStartIndexes(currDoc)\n",
    "senStarts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "getSentenceForWordPosition(108, senStarts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>isAnswer</th>\n",
       "      <th>titleId</th>\n",
       "      <th>paragrapghId</th>\n",
       "      <th>sentenceId</th>\n",
       "      <th>wordCount</th>\n",
       "      <th>NER</th>\n",
       "      <th>POS</th>\n",
       "      <th>TAG</th>\n",
       "      <th>DEP</th>\n",
       "      <th>shape</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [text, isAnswer, titleId, paragrapghId, sentenceId, wordCount, NER, POS, TAG, DEP, shape]\n",
       "Index: []"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Creating the dataframe\n",
    "wordColums = ['text', 'isAnswer', 'titleId', 'paragrapghId', 'sentenceId','wordCount', 'NER', 'POS', 'TAG', 'DEP','shape']\n",
    "wordDf = pd.DataFrame(columns=wordColums)\n",
    "\n",
    "#Save to pickle\n",
    "\n",
    "#load df\n",
    "\n",
    "#Add new words to array\n",
    "newWord = ['koala', True, 0, 0, 4, 1, None, None, None, None, 'xxxxx']\n",
    "newWords = []\n",
    "#newWords.append(newWord)\n",
    "\n",
    "#Make array to dataframe\n",
    "newWordsDf = pd.DataFrame(newWords, columns=wordColums)\n",
    "newWordsDf\n",
    "\n",
    "#Merge dataframes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def addWordsForParagrapgh(newWords, titleId, paragraphId):\n",
    "    text = df['data'][titleId]['paragraphs'][paragraphId]['context']\n",
    "    qas = df['data'][titleId]['paragraphs'][paragraphId]['qas']\n",
    "\n",
    "    doc = nlp(text)\n",
    "\n",
    "    answers = extractAnswers(qas, doc)\n",
    "    neStarts = getNEStartIndexs(doc)\n",
    "    senStarts = getSentenceStartIndexes(doc)\n",
    "    \n",
    "    #index of word in spacy doc text\n",
    "    i = 0\n",
    "    \n",
    "    while (i < len(doc)):\n",
    "        #If the token is a start of a Named Entity, add it and push to index to end of the NE\n",
    "        if (i in neStarts):\n",
    "            word = neStarts[i]\n",
    "            #add word\n",
    "            currentSentence = getSentenceForWordPosition(word.start, senStarts)\n",
    "            wordLen = word.end - word.start\n",
    "            shape = ''\n",
    "            for wordIndex in range(word.start, word.end):\n",
    "                shape += (' ' + doc[wordIndex].shape_)\n",
    "\n",
    "            newWords.append([word.text,\n",
    "                            tokenIsAnswer(word.text, currentSentence, answers),\n",
    "                            titleId,\n",
    "                            paragraphId,\n",
    "                            currentSentence,\n",
    "                            wordLen,\n",
    "                            word.label_,\n",
    "                            None,\n",
    "                            None,\n",
    "                            None,\n",
    "                            shape])\n",
    "            i = neStarts[i].end - 1\n",
    "        #If not a NE, add the word if it's not a stopword or a non-alpha (not regular letters)\n",
    "        else:\n",
    "            if (doc[i].is_stop == False and doc[i].is_alpha == True):\n",
    "                word = doc[i]\n",
    "\n",
    "                currentSentence = getSentenceForWordPosition(i, senStarts)\n",
    "                wordLen = 1\n",
    "\n",
    "                newWords.append([word.text,\n",
    "                                tokenIsAnswer(word.text, currentSentence, answers),\n",
    "                                titleId,\n",
    "                                paragraphId,\n",
    "                                currentSentence,\n",
    "                                wordLen,\n",
    "                                None,\n",
    "                                word.pos_,\n",
    "                                word.tag_,\n",
    "                                word.dep_,\n",
    "                                word.shape_])\n",
    "        i += 1\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "#TODO For each token add, for each NE add... "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "newWords"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "addWordsForParagrapgh(newWords, 0, 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Architecturally', False, 0, 0, 0, 1, None, 'ADV', 'RB', 'advmod', 'Xxxxx']"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "newWords[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>isAnswer</th>\n",
       "      <th>titleId</th>\n",
       "      <th>paragrapghId</th>\n",
       "      <th>sentenceId</th>\n",
       "      <th>wordCount</th>\n",
       "      <th>NER</th>\n",
       "      <th>POS</th>\n",
       "      <th>TAG</th>\n",
       "      <th>DEP</th>\n",
       "      <th>shape</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Architecturally</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "      <td>ADV</td>\n",
       "      <td>RB</td>\n",
       "      <td>advmod</td>\n",
       "      <td>Xxxxx</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>school</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>NN</td>\n",
       "      <td>nsubj</td>\n",
       "      <td>xxxx</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Catholic</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>NORP</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>Xxxxx</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>character</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>NN</td>\n",
       "      <td>dobj</td>\n",
       "      <td>xxxx</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Atop</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "      <td>ADP</td>\n",
       "      <td>IN</td>\n",
       "      <td>prep</td>\n",
       "      <td>Xxxx</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              text  isAnswer  titleId  paragrapghId  sentenceId  wordCount  \\\n",
       "0  Architecturally     False        0             0         0.0          1   \n",
       "1           school     False        0             0         0.0          1   \n",
       "2         Catholic     False        0             0         0.0          1   \n",
       "3        character     False        0             0         0.0          1   \n",
       "4             Atop     False        0             0         1.0          1   \n",
       "\n",
       "    NER   POS   TAG     DEP   shape  \n",
       "0  None   ADV    RB  advmod   Xxxxx  \n",
       "1  None  NOUN    NN   nsubj    xxxx  \n",
       "2  NORP  None  None    None   Xxxxx  \n",
       "3  None  NOUN    NN    dobj    xxxx  \n",
       "4  None   ADP    IN    prep    Xxxx  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "newWordsDf = pd.DataFrame(newWords, columns=wordColums)\n",
    "newWordsDf.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>isAnswer</th>\n",
       "      <th>titleId</th>\n",
       "      <th>paragrapghId</th>\n",
       "      <th>sentenceId</th>\n",
       "      <th>wordCount</th>\n",
       "      <th>NER</th>\n",
       "      <th>POS</th>\n",
       "      <th>TAG</th>\n",
       "      <th>DEP</th>\n",
       "      <th>shape</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>the Main Building</td>\n",
       "      <td>True</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3</td>\n",
       "      <td>FAC</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>xxx Xxxx Xxxxx</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38</th>\n",
       "      <td>Saint Bernadette Soubirous</td>\n",
       "      <td>True</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3</td>\n",
       "      <td>PERSON</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>Xxxxx Xxxxx Xxxxx</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          text  isAnswer  titleId  paragrapghId  sentenceId  \\\n",
       "22           the Main Building      True        0             0         3.0   \n",
       "38  Saint Bernadette Soubirous      True        0             0         5.0   \n",
       "\n",
       "    wordCount     NER   POS   TAG   DEP               shape  \n",
       "22          3     FAC  None  None  None      xxx Xxxx Xxxxx  \n",
       "38          3  PERSON  None  None  None   Xxxxx Xxxxx Xxxxx  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "newWordsDf[newWordsDf['isAnswer'] == True].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generating a words for 2 titles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.84s/it]\n"
     ]
    }
   ],
   "source": [
    "words = []\n",
    "\n",
    "#titlesCount = len(df['data'])\n",
    "titlesCount = 2\n",
    "\n",
    "for titleId in tqdm(range(titlesCount)):\n",
    "    paragraphsCount = len(df['data'][titleId]['paragraphs'])\n",
    "        \n",
    "    for paragraphId in range(paragraphsCount):\n",
    "        addWordsForParagrapgh(words, titleId, paragraphId)\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>isAnswer</th>\n",
       "      <th>titleId</th>\n",
       "      <th>paragrapghId</th>\n",
       "      <th>sentenceId</th>\n",
       "      <th>wordCount</th>\n",
       "      <th>NER</th>\n",
       "      <th>POS</th>\n",
       "      <th>TAG</th>\n",
       "      <th>DEP</th>\n",
       "      <th>shape</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Architecturally</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "      <td>ADV</td>\n",
       "      <td>RB</td>\n",
       "      <td>advmod</td>\n",
       "      <td>Xxxxx</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>school</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>NN</td>\n",
       "      <td>nsubj</td>\n",
       "      <td>xxxx</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Catholic</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>NORP</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>Xxxxx</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>character</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>NN</td>\n",
       "      <td>dobj</td>\n",
       "      <td>xxxx</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Atop</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "      <td>ADP</td>\n",
       "      <td>IN</td>\n",
       "      <td>prep</td>\n",
       "      <td>Xxxx</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              text  isAnswer  titleId  paragrapghId  sentenceId  wordCount  \\\n",
       "0  Architecturally     False        0             0         0.0          1   \n",
       "1           school     False        0             0         0.0          1   \n",
       "2         Catholic     False        0             0         0.0          1   \n",
       "3        character     False        0             0         0.0          1   \n",
       "4             Atop     False        0             0         1.0          1   \n",
       "\n",
       "    NER   POS   TAG     DEP   shape  \n",
       "0  None   ADV    RB  advmod   Xxxxx  \n",
       "1  None  NOUN    NN   nsubj    xxxx  \n",
       "2  NORP  None  None    None   Xxxxx  \n",
       "3  None  NOUN    NN    dobj    xxxx  \n",
       "4  None   ADP    IN    prep    Xxxx  "
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordsDf = pd.DataFrame(words, columns=wordColums)\n",
    "wordsDf.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total words for 2 articles: 8678\n"
     ]
    }
   ],
   "source": [
    "print(\"Total words for 2 articles:\", len(wordsDf))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating the entire word dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pickle found. Saved some time.\n"
     ]
    }
   ],
   "source": [
    "wordPickleName = '../data/pickles/wordsDf.pkl'\n",
    "\n",
    "#If the dataframe is already generated, load it.\n",
    "if (pickleExists(wordPickleName)):\n",
    "    print(\"Pickle found. Saved some time.\")\n",
    "    wordsDf = loadPickle(wordPickleName)\n",
    "else:\n",
    "    #Extracting words\n",
    "    words = []\n",
    "\n",
    "#     titlesCount = len(df['data'])   \n",
    "    titlesCount = 2   \n",
    "\n",
    "    for titleId in tqdm(range(titl8esCount)):\n",
    "        paragraphsCount = len(df['data'][titleId]['paragraphs'])\n",
    "\n",
    "#         printProgress(titleId, titlesCount - 1)\n",
    "\n",
    "        for paragraphId in range(paragraphsCount):\n",
    "            addWordsForParagrapgh(words, titleId, paragraphId)\n",
    "    \n",
    "    #Create the dataframe\n",
    "    wordColums = ['text', 'isAnswer', 'titleId', 'paragrapghId', 'sentenceId','wordCount', 'NER', 'POS', 'TAG', 'DEP','shape']\n",
    "    wordsDf = pd.DataFrame(words, columns=wordColums)\n",
    "    \n",
    "    #Pickle the result\n",
    "    dumpPickle(wordPickleName, wordsDf)\n",
    "    print(\"Result was not pickled. You had to wait.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Total extracted words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total words for all articles: 8678\n"
     ]
    }
   ],
   "source": [
    "print(\"Total words for all articles:\", len(wordsDf))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check what percentage of the extracted words are answers in the dataframe. They should be pretty low"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "383 total answers 4.41% of all words are answers.\n"
     ]
    }
   ],
   "source": [
    "totalAnswers = len(wordsDf[wordsDf['isAnswer'] == True])\n",
    "print(totalAnswers, 'total answers', '{:.2f}%'.format((totalAnswers / len(wordsDf)) * 100), 'of all words are answers.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
